ETC5521 Tutorial 2

Introduction to exploratory data analysis

Author

Prof. Di Cook

Published

29 July 2024

🎯 Objectives

The purpose of this tutorial is to scope out the software reporting to do EDA in R. We want to understand the capabilities and the limitations.

🔧 Preparation

The reading for this week is The Landscape of R Packages for Automated Exploratory Data Analysis. This is a lovely summary of software available that is considered to do exploratory data analysis (EDA). (Note: Dr Cook considers these to be mostly descriptive statistics packages, not exploratory data analysis in the true spirit of the term.) This reading will be the basis of the tutorial exercises today.

  • Complete the weekly quiz, before the deadline!
  • Install this list of R packages, in addition to what you installed in the previous weeks:
install.packages(c("arsenal", "autoEDA", "DataExplorer", "dataMaid", "dlookr", "ExPanDaR", "explore", "exploreR", "funModeling", "inspectdf", "RtutoR", "SmartEDA", "summarytools", "visdat", "xray", "cranlogs", "tidyverse", "nycflights13"))
  • Open your RStudio Project for this unit, (the one you created in week 1, ETC5521). Create a .qmd document for this weeks activities.

📥 Exercises

The article lists a number of R packages that might be used for EDA: arsenal, autoEDA, DataExplorer, dataMaid, dlookr, ExPanDaR, explore, exploreR, funModeling, inspectdf, RtutoR, SmartEDA, summarytools, visdat, xray.

1.

What package had the highest number of CRAN downloads as of 12.07.2019? (Based on the paper.)

summarytools with 84737

2.

Open up the shiny server for checking download rates at https://hadley.shinyapps.io/cran-downloads/. Which of these packages has the highest download rate over the period Jan 1, 2024-today?

library(cranlogs)
eda_pkgs <- cran_downloads(packages=c("arsenal", "autoEDA", "DataExplorer", "dataMaid", "dlookr", "ExPanDaR", "explore", "exploreR", "funModeling", "inspectdf", "RtutoR", "SmartEDA", "summarytools", "visdat", "xray"), from="2024-01-01", to=lubridate::today())
eda_pkgs |> 
  group_by(package) |>
  summarise(m=mean(count)) |>
  arrange(desc(m))
# A tibble: 15 × 2
   package           m
   <chr>         <dbl>
 1 summarytools 597.  
 2 SmartEDA     574.  
 3 visdat       487.  
 4 DataExplorer 364.  
 5 arsenal      222.  
 6 dlookr        97.3 
 7 explore       54.8 
 8 funModeling   45.7 
 9 inspectdf     44.0 
10 dataMaid      31.9 
11 ExPanDaR      16.9 
12 xray           9.79
13 exploreR       7.35
14 RtutoR         2.13
15 autoEDA        0   

smartEDA was on a roll early in the year, but has virtually disappeared. summarytools is going strong. Interestingly, visdat, high on the list, was developed by Nick Tierney in the years he was at Monash.

3.

What is an interesting pattern to observe from the time series plot of all the downloads?

The weekly seasonality! There is a regular up/down pattern, that if you zoom in closely - try plotting just a couple of weeks of data - you can see corresponds to week day vs weekend.

4.

How many functions does Staniak and Biecek (2019) say visdat has for doing EDA? Explore what each of them does, by running the example code for each function. What do you think are the features that make visdat a really popular package?

6; Simple focus, useful functions that apply to a lot of problems.

It gives an overview of the variable types, and missing values. This is useful to start a data analysis so that you can start working out what methods might be applied to each, and also whether some variables may have too many missing values to be analysable.

library(visdat)
# function 1
vis_dat(airquality)

# function 2
messy_vector <- c(TRUE,
                 "TRUE",
                 "T",
                 "01/01/01",
                 "01/01/2001",
                 NA,
                 NaN,
                 "NA",
                 "Na",
                 "na",
                 "10",
                 10,
                 "10.1",
                 10.1,
                 "abc",
                 "$%TG")
set.seed(1114)
messy_df <- data.frame(var1 = messy_vector,
                       var2 = sample(messy_vector),
                       var3 = sample(messy_vector))
vis_guess(messy_df)

# function 3
vis_miss(airquality)

# function 4
aq_diff <- airquality
aq_diff[1:10, 1:2] <- NA
vis_compare(airquality, aq_diff)

# function 5
dat_test <- tibble::tribble(
            ~x, ~y,
            -1,  "A",
            0,  "B",
            1,  "C",
            NA, NA
            )

vis_expect(dat_test, ~.x == -1)

# function 6
vis_cor(airquality)

5.

The package summarytools appears to becoming more favourable relative to visdat. Take a look at this package and explain what tools it has that are not available in visdat.

library(summarytools)
dfSummary(tobacco)
Data Frame Summary  
tobacco  
Dimensions: 1000 x 9  
Duplicates: 2  

-----------------------------------------------------------------------------------------------------------
No   Variable       Stats / Values         Freqs (% of Valid)    Graph                 Valid      Missing  
---- -------------- ---------------------- --------------------- --------------------- ---------- ---------
1    gender         1. F                   489 (50.0%)           IIIIIIIIII            978        22       
     [factor]       2. M                   489 (50.0%)           IIIIIIIIII            (97.8%)    (2.2%)   

2    age            Mean (sd) : 50 (18)    63 distinct values    .     .     . . . :   975        25       
     [numeric]      min < med < max:                             : : : : : . : : : :   (97.5%)    (2.5%)   
                    18 < 50 < 80                                 : : : : : : : : : :                       
                    IQR (CV) : 32 (0.4)                          : : : : : : : : : :                       
                                                                 : : : : : : : : : :                       

3    age.gr         1. 18-34               258 (26.5%)           IIIII                 975        25       
     [factor]       2. 35-50               241 (24.7%)           IIII                  (97.5%)    (2.5%)   
                    3. 51-70               317 (32.5%)           IIIIII                                    
                    4. 71 +                159 (16.3%)           III                                       

4    BMI            Mean (sd) : 26 (4.5)   974 distinct values             :           974        26       
     [numeric]      min < med < max:                                     : : :         (97.4%)    (2.6%)   
                    8.8 < 26 < 39                                        : : :                             
                    IQR (CV) : 5.7 (0.2)                               : : : : :                           
                                                                     . : : : : : .                         

5    smoker         1. Yes                 298 (29.8%)           IIIII                 1000       0        
     [factor]       2. No                  702 (70.2%)           IIIIIIIIIIIIII        (100.0%)   (0.0%)   

6    cigs.per.day   Mean (sd) : 6.8 (12)   37 distinct values    :                     965        35       
     [numeric]      min < med < max:                             :                     (96.5%)    (3.5%)   
                    0 < 0 < 40                                   :                                         
                    IQR (CV) : 11 (1.8)                          :                                         
                                                                 :   . . . . . .                           

7    diseased       1. Yes                 224 (22.4%)           IIII                  1000       0        
     [factor]       2. No                  776 (77.6%)           IIIIIIIIIIIIIII       (100.0%)   (0.0%)   

8    disease        1. Hypertension        36 (16.2%)            III                   222        778      
     [character]    2. Cancer              34 (15.3%)            III                   (22.2%)    (77.8%)  
                    3. Cholesterol         21 ( 9.5%)            I                                         
                    4. Heart               20 ( 9.0%)            I                                         
                    5. Pulmonary           20 ( 9.0%)            I                                         
                    6. Musculoskeletal     19 ( 8.6%)            I                                         
                    7. Diabetes            14 ( 6.3%)            I                                         
                    8. Hearing             14 ( 6.3%)            I                                         
                    9. Digestive           12 ( 5.4%)            I                                         
                    10. Hypotension        11 ( 5.0%)                                                      
                    [ 3 others ]           21 ( 9.5%)            I                                         

9    samp.wgts      Mean (sd) : 1 (0.1)    0.86!: 267 (26.7%)    IIIII                 1000       0        
     [numeric]      min < med < max:       1.04!: 249 (24.9%)    IIII                  (100.0%)   (0.0%)   
                    0.9 < 1 < 1.1          1.05!: 324 (32.4%)    IIIIII                                    
                    IQR (CV) : 0.2 (0.1)   1.06!: 160 (16.0%)    III                                       
                                           ! rounded                                                       
-----------------------------------------------------------------------------------------------------------

It has a lot more standard statistical summaries, and outputs the summary in a form that can be embedded nicely in a quarto report.

If you go to the package website https://github.com/dcomtois/summarytools it’s not very promising. There is no package website only the GitHub. There have been no updates for 8 months.

6.

Why do you think the package SmartEDA has gone out of favour?

The web site looks a little uncared for, missing images on the main GitHub page. It has a nice package pacge, but again pieces are missing.

library(SmartEDA)
ExpData(data=mtcars,type=1)
                                          Descriptions     Value
1                                   Sample size (nrow)        32
2                              No. of variables (ncol)        11
3                    No. of numeric/interger variables        11
4                              No. of factor variables         0
5                                No. of text variables         0
6                             No. of logical variables         0
7                          No. of identifier variables         0
8                                No. of date variables         0
9             No. of zero variance variables (uniform)         0
10               %. of variables having complete cases 100% (11)
11   %. of variables having >0% and <50% missing cases    0% (0)
12 %. of variables having >=50% and <90% missing cases    0% (0)
13          %. of variables having >=90% missing cases    0% (0)
ExpData(data=mtcars,type=2)
   Index Variable_Name Variable_Type Sample_n Missing_Count Per_of_Missing
1      1           mpg       numeric       32             0              0
2      2           cyl       numeric       32             0              0
3      3          disp       numeric       32             0              0
4      4            hp       numeric       32             0              0
5      5          drat       numeric       32             0              0
6      6            wt       numeric       32             0              0
7      7          qsec       numeric       32             0              0
8      8            vs       numeric       32             0              0
9      9            am       numeric       32             0              0
10    10          gear       numeric       32             0              0
11    11          carb       numeric       32             0              0
   No_of_distinct_values
1                     25
2                      3
3                     27
4                     22
5                     22
6                     29
7                     30
8                      2
9                      2
10                     3
11                     6

It also has some standard summaries like summarytools which is possibly why it became attractive. And can be used to produce a detailed report on the data, although a report like the one in the next question took ages to complete.

I’m not sure why it’s popularity has dropped. The documentation is not great, there’s quite a few spelling errors. The functions are (sort of) in CamelCase which is not as attractive as snake_case.

7.

The SmartEDA::ExpReport() and DataExplorer::create_report() are functions that generates a long report when provided with a data set. Try this using the latter function, on the nycflights13 data, using this code:

# DataExplorer
library(DataExplorer)
library(nycflights13)
library(tidyverse)

# Create a big data set
airlines_all <- flights |> 
  full_join(airlines, by = "carrier") |>
  full_join(planes, by = "tailnum", 
            suffix = c("_flights", "_planes")) |>
  full_join(airports, by = c("origin"="faa"), 
            suffix = c("_carrier", "_origin")) |>
  full_join(airports, by = c("dest"="faa"), 
            suffix = c("_origin", "_dest"))
create_report(airlines_all, y = "arr_delay")

The code instructs to use arr_delay as a “response variable”. Give some reasons why this report is not very useful!

There is a lot of information, and many of the plots and tables are particularly unreadable.

  1. Have your Generative AI assistant suggest what to look at when it is given the variable summary, from:
glimpse(airlines_all)

Here’s what I got from Claude:

Based on the dataset you've provided, which appears to be about airline flights, I can suggest several plots and summaries that would be informative. Here are some recommendations:

Time-based analysis:

Line plot of average delays (dep_delay and arr_delay) by month and year
Heatmap of delays by day of week and hour of day
Box plots of air_time by month


Carrier analysis:

Bar plot of flight counts by carrier
Box plots of delays by carrier
Scatter plot of departure delay vs. arrival delay, colored by carrier


Route analysis:

Bar plot of most common routes (origin-dest pairs)
Map visualization of routes with line thickness indicating frequency


Distance analysis:

Scatter plot of distance vs. air_time
Histogram of flight distances


Aircraft analysis:

Bar plot of most common aircraft models
Box plot of delays by aircraft manufacturer


Airport analysis:

Bar plot of busiest airports (by departure and arrival)
Map of airports with point size indicating traffic volume


Summaries:

Table of average delays by carrier
Summary statistics (mean, median, std dev) for air_time and distance
Correlation matrix of numeric variables


Time series:

Time series decomposition of daily flight counts


Categorical data:

Stacked bar charts for engine types by manufacturer


Performance metrics:

Scatter plot of scheduled arrival time vs. actual arrival time

It’s nicely organise by different types of analysis activities that might be done. Each of these require different ways to aggregate the data, to make summaries or plots by various categorical variables. But it doesn’t suggest pre-processing that might be necessary, so I asked it to tell me what might be needed:

Before creating visualizations and summaries, some pre-processing steps would be beneficial for this dataset. Here are some recommendations:

Handle missing values:

Check for NA or null values, especially in columns like speed, air_time, and delay fields.
Decide whether to impute missing values or remove rows with missing data, depending on the extent and nature of missingness.


Date and time formatting:

Combine year_flights, month, and day into a single date column.
Ensure time_hour is properly formatted as a datetime object.


Categorical data:

Convert appropriate columns to factors (e.g., carrier, origin, dest, manufacturer).
Check for and handle any inconsistent spellings or codings in categorical variables.


Delay calculations:

Create a total_delay column by summing dep_delay and arr_delay.
Consider creating a binary column for whether a flight was delayed (e.g., over 15 minutes).


Outlier detection:

Check for and possibly handle extreme values in delay times, air_time, and distance.


Feature engineering:

Create a day_of_week column from the date.
Calculate flight speed from distance and air_time.
Create a season column based on month.


Data type conversions:

Ensure numeric columns are of the appropriate type (int or float).
Convert time columns (dep_time, arr_time, etc.) to a more usable format if they're not already.


Geocoding:

Ensure lat and lon columns for both origin and destination are numeric and contain valid coordinates.


Consistency checks:

Verify that arr_time is always later than dep_time.
Check that air_time is consistent with the difference between arr_time and dep_time.


Merging data:

If the planes and airports data are in separate datasets, merge them carefully with the main flights data.


Creating subsets:

Consider creating smaller datasets for specific analyses (e.g., by year, by carrier) to improve processing speed.

This is quite useful!

8.

In a limited fashion, lets work through some areas suggested by Claude.

  1. Use visdat to examine the variable types and missing values. You’ll need to take a sample of the data because there are too many observations to reasonably plot. However, a sample should give reasonable insight on the reliability of most variables. What variables may not be useful because they have too many missing values?
vis_dat(airlines_all[sample(1:nrow(airlines_all), 1000),])

speed is mostly missing. It definitely cannot be used.

type, manufacturer, model, engine, year_planes, engines, seats have a lot of missings, and so maybe unusable. It is possible that these missings are for some types of planes, and other types of planes have more complete records.

For the rest of these questions, you decide how to process the data, make summaries or plots to provide answers.

  1. Which carrier had the most flights?
  2. Is this the same for each month? Or day of the week?
  3. Are there more departure delays for flights in the morning hours, or evening hours?
  4. Find an error in the data, e.g. a flight that arrived before it left.
  5. With your neighbour in the tutorial come up with one thing that is a bit surprising to you that you can learn from this data. Make sure you state what you expected to see, and why what you saw was then a surprise. (It is possible that you can use the DataExplorer report to look at something you had not thought to examine, as motivation.)
airlines_all |> count(carrier, sort=TRUE)
# A tibble: 17 × 2
   carrier     n
   <chr>   <int>
 1 UA      58665
 2 B6      54635
 3 EV      54173
 4 DL      48110
 5 AA      32729
 6 MQ      26397
 7 US      20536
 8 9E      18460
 9 WN      12275
10 VX       5162
11 FL       3260
12 <NA>     2812
13 AS        714
14 F9        685
15 YV        601
16 HA        342
17 OO         32
airlines_all |> 
  filter(carrier %in% c("UA", "B6", "EV", "DL")) |>
  mutate(carrier = factor(carrier, levels = c("UA", "B6", "EV", "DL"))) |>
  count(month, carrier) |>
  mutate(month = factor(month, levels=1:12)) |>
  pivot_wider(names_from = carrier, 
              values_from = n, 
              values_fill=0)
# A tibble: 12 × 5
   month    UA    B6    EV    DL
   <fct> <int> <int> <int> <int>
 1 1      4637  4427  4171  3690
 2 2      4346  4103  3827  3444
 3 3      4971  4772  4726  4189
 4 4      5047  4517  4561  4092
 5 5      4960  4576  4817  4082
 6 6      4975  4622  4456  4126
 7 7      5066  4984  4641  4251
 8 8      5124  4952  4563  4318
 9 9      4694  4291  4725  3883
10 10     5060  4361  4908  4093
11 11     4854  4289  4471  3849
12 12     4931  4741  4307  4093
airlines_all |> 
  filter(carrier %in% c("UA", "B6", "EV", "DL")) |>
  mutate(carrier = factor(carrier, levels = c("UA", "B6", "EV", "DL"))) |>
  count(month, carrier) |>
  ggplot(aes(x=month, y=n, colour=carrier)) + 
    geom_line() +
    scale_x_continuous("", breaks = 1:12)

airlines_all |>
  group_by(hour) |>
  summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
  ggplot(aes(x=hour, y=dep_delay)) +
    geom_point() +
    geom_smooth(se=F)

airlines_all |>
  ggplot() +
    geom_abline(intercept=0, slope=1, colour="red") +
    geom_point(aes(x=dep_delay, y=arr_delay)) +
    theme(aspect.ratio=1)

airlines_all |>
  ggplot() +
    geom_abline(intercept=0, slope=1, colour="red") +
    geom_point(aes(x=dep_time, y=arr_time)) +
    theme(aspect.ratio=1)

airlines_all |> 
  filter(origin == dest)
# A tibble: 0 × 42
# ℹ 42 variables: year_flights <int>, month <int>, day <int>, dep_time <int>,
#   sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, name_carrier <chr>,
#   year_planes <int>, type <chr>, manufacturer <chr>, model <chr>,
#   engines <int>, seats <int>, speed <int>, engine <chr>, name_origin <chr>, …
  1. UA, United Airlines
  2. Ooh, September EV had more flights than UA!
  3. Definitely evening hours.
  4. There are some flights that leave substantially early, maybe not a mistake but worrying for a traveller.
  5. What did you find?

9.

Table 2 of the Landscape paper summarises the activities of two early phases of the CRISP-DM standard. What does CRISP-DM mean? The implication is that EDA is related to “data understanding” and “data preparation”. Would you agree with this or disagree? Why?

Cross-Industry Standard Process for Data Mining; EDA techniques can be useful for some parts of these stages, for example finding outliers, or examining missing value patterns. Some of these steps are important for effective EDA, too, for example, you need to know what types of variables you have in order to decide what types of plots to make.

10.

Table 1 of the paper is summarising CRAN downloads and GitHub activity is hard to read. How are the rows sorted? What is the most important information communicated by the table? In what way(s) might revising this table make it easier to read and digest the most important information?

Most important information is the download rate because the purpose is to know which are the commonly used packages. Sorting rows by downloads makes the table easier to read.

👌 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.